An Efficient Dual Approach to 
Distance Metric Learning 

Chunhua Shen, Junae Kim, Fayao Liu, Lei Wang, Anton van den Hengel 



Abstract — Distance metric learning is of fundamental interest 
in machine learning because the distance metric employed can 
significantly affect the performance of many learning methods. 
Quadratic Mahalanobis metric learning is a popular approach 
to the problem, but typically requires solving a semidefinite 
programming (SDP) problem, which is computationally ex- 
pensive. Standard interior-point SDP solvers typically have a 
complexity of 0{D^'^) (with D the dimension of input data), 
and can thus only practically solve problems exhibiting less 
than a few thousand variables. Since the number of variables 
is D{D + 1) this implies a limit upon the size of problem that 
can practically be solved of around a few hundred dimensions. 
The complexity of the popular quadratic Mahalanobis metric 
learning approach thus limits the size of problem to which 
metric learning can be applied. Here we propose a significantly 
more efficient approach to the metric learning problem based 
on the Lagrange dual formulation of the problem. The proposed 
formulation is much simpler to implement, and therefore allows 
much larger Mahalanobis metric learning problems to be solved. 
The time complexity of the proposed method is 0{D^), which is 
significantly lower than that of the SDP approach. Experiments 
on a variety of datasets demonstrate that the proposed method 
achieves an accuracy comparable to the state-of-the-art, but is 
applicable to significantly larger problems. We also show that the 
proposed method can be applied to solve more general Frobenius- 
norm regularized SDP problems approximately. 

Index Terms — Mahalanobis distance, metric learning, semidef- 
inite programming, convex optimization, Lagrange duality. 
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I. Introduction 

Distance metric learning has attracted a lot of research 
interest recently in the machine learning and pattern recog- 
nition community due to its wide applications in various 
areas l|l|~lE|- Methods relying upon the identification of an 
appropriate data-dependent distance metric have been applied 
to a range of problems, from image classification and object 
recognition, to the analysis of genomes. The performance of 
many classic algorithms such as /c-nearest neighbor (/cNN) and 
/c-means clustering depends critically upon the distance metric 
employed. 

Large-margin metric learning is an approach which focuses 
on identifying a metric by which the data points within the 
same class lie close to each other and those in different 
classes are separated by a large margin. Weinberger et aUs 
large-margin nearest neighbor (LMNN) yj is a seminal work 
illustrating the approach whereby the metric takes the form 
of a Mahanalobis distance. Given input data a G M^, this 
approach to the metric learning problem can be framed as 
that of learning the linear transformation L which optimizes 
a criterion expressed in terms of Euclidean distances amongst 
the projected data La G M^. 

In order to obtain a convex problem, instead of learning 
the projection matrix (L G R^^^), one usually optimizes over 
the quadratic product of the projection matrix (X = LL^) 
|[T|, |[3|. This linearization convexifies the original non-convex 
problem. The projection matrix may then be recovered by an 
eigen-decomposition or Cholesky decomposition of X. 

Typical methods that learn the projection matrix L are 
most of the spectral dimensionality reduction methods such 
as principle component analysis (PC A), Fisher linear dis- 
criminant analysis (LDA); and also neighborhood component 
analysis (NCA) relevant component analysis (RCA) 
Goldberger et ah showed that NCA may outperform tra- 
ditional dimensionality reduction methods Q. NCA learns 
the projection matrix directly through optimization of a non- 
convex objective function. NCA is thus prone to becoming 
trapped in local optima, particularly when applied to high- 
dimensional problems. RCA |^ is an unsupervised metric 
learning method. RCA does not maximize the distance be- 
tween different classes, but minimizes the distance between 
data in Chunklets. Chunklets consist of data that come from 
the same (although unknown) class. 

More methods on the topic of large-margin metric learning 
actually learn X directly since Xing et al. |3| proposed 
a global distance metric learning approach using a convex 
optimization method. Although the experiments in |3 | show 
improved performance on clustering problems, this is not 
the case when the method is applied to most classification 
problems. Davis et al. |[7l proposed an information theo- 
retic metric learning (ITML) approach to the problem. The 
closest work to ours may be LMNN yj and BoostMetric 
|[8|. LMNN is a Mahanalobis metric form of /c-NN whereby 
the Mahanalobis metric is optimized such that the /c-nearest 
neighbors are encouraged to belong to the same class while 
data points from different classes are separated by a large 
margin. The optimization take the form of an SDP problem. 



In order to improve the scalability of the algorithm, instead 
of using standard SDP solvers, Weinberger et al |1| pro- 
posed an alternating estimation and projection method. At 
each iteration, the updated estimate X is projected back to 
the semidefinite cone using eigen-decomposition, in order to 
preserve the semi-definiteness of X. In this sense, at each 
iteration, the computational complexity of their algorithm is 
similar to that of ours. However, the alternating method needs 
an extremely large number of iterations to converge (the 
default value being 10,000 in the authors' implementation). 
In contrast, our algorithm solves the corresponding Lagrange 
dual problem and needs only 20 ~ 30 iterations in most cases. 
In addition, the algorithm we propose is significantly easier to 
implement. 

As pointed in these earlier work, the disadvantage of solving 
for X is that one needs to solve a semidefinite programming 
(SDP) problem since X must be positive semidefinite (p.s.d.). 
Conventional interior-point SDP solvers have a computation 
complexity of 0{D^-^), with D the dimension of input data. 
This high complexity hampers the application of metric learn- 
ing to high-dimensional problems. 

To tackle this problem, here we propose here a new 
formulation of quadratic Mahalanobis metric learning using 
proximity comparison information and Frobenius norm reg- 
ularization. The main contribution is that, with the proposed 
formulation, we can very efficiently solve the SDP problem 
in the dual space. Because strong duality holds, we can then 
recover the primal variable X from the dual solution. The 
computational complexity of the optimization is dominated 
by eigen-decomposition, which is 0{D^) and thus the overall 
complexity is 0{t ■ D^), where t is the number of iterations 
required for convergence. Note that t does not depend on the 
size of the data, and is typically t ^ 20 ~ 30. 

A number of methods exist in the literature for large- 
scale p.s.d. metric learning. Shen et al |8|, |9l introduced 
BoostMetric by adapting the boosting technique, typically 
applied to classification, to distance metric learning. This 
work exploits an important theorem which shows that a 
positive semidefinite matrix with trace of one can always be 
represented as a convex combination of multiple rank-one 
matrices. The work of Shen et al generalized LPBoost |TQ| 
and AdaBoost by showing that it is possible to use matrices as 
weak learners within these algorithms, in addition to the more 
traditional use of classifiers or regressors as weak learners. 
The approach we propose here, FrobMetric, is inspired by 
BoostMetric in the sense that both algorithms use proximity 
comparisons between triplets as the source of the training 
information. The critical distinction between FrobMetric and 
BoostMetric, however, is that reformulating the problem to 
use the Frobenius regularization — rather than the trace norm 
regularization — allows the development of a dual form of the 
resulting optimization problem which may be solved far more 
efficiently. The BoostMetric approach iteratively computes 
the squared Mahalanobis distance metric using a rank-one 
update at each iteration. This has the advantage that only the 
leading eigenvector need be calculated, but leads to slower 
convergence. Indeed, for BoostMetric, the convergence rate 
remains unclear. The proposed FrobMetric method, in contrast. 
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requires more calculations per iteration, but converges in 
significantly fewer iterations. Actually in our implementation, 
the convergence rate of FrobMetric is guaranteed by the 
employed Quasi-Newton method. 

The main contributions of this work are as follows. 
1) We propose a novel formulation of the metric learning 
problem, based on the application of Frobenius norm 
regularization. 

We develop a method for solving this formulation of the 
problem which is based on optimizing its Lagrange dual. 
This method may be practically applied to a much more 
complex datasets than the competing SDP approach, as it 
scales better to large databases and to high dimensional 
data. 

We generalize the method such that it may be used 
to solve any Frobenius norm regularized SDP problem. 
Such problems have many applications in machine learn- 
ing and computer vision, and by way of example we 
show that it may be used to approximately solve the 
Frobenius norm perturbed maximum variance unfolding 
(MVU) problem |11 1. We demonstrate that the proposed 
method is considerably more efficient than the original 
MVU implementation on a variety of data sets and that 
a plausible embedding is obtained. 
The proposed scalable semidefinite optimization method can 
be viewed as an extension of the work of Boyd and Xiao | [12| . 
The subject of Boyd and Xiao in fT2^ was similarly a semidef- 
inite least- squares problem: finding the covariance matrix that 
is closest to a given matrix under the Frobenius norm metric. 
Here we study the large-margin Mahalanobis metric learning 
problem, where, in contrast, the objective function is not a 



least squares fitting problem. We also discuss, in Section[III|the 
application of the proposed approach to general SDP problems 
which have Frobenius norm regularization terms. Note also 
that a precursor to the approach described here also appeared 
in |T3| . Here we have provided more theoretical analysis as 
well as experimental results. 

In summary, we propose a simple, efficient and scalable op- 
timization method for quadratic Mahalanobis metric learning. 
The formulated optimization problem is convex, thus guaran- 
teeing that the global optimum can be attained in polynomial 
time |14|. Moreover, by working with the Lagrange dual 
problem, we are able to use off-the-shelf eigen-decomposition 
and gradient descent methods such as L-BFGS-B to solve the 
problem. 

A. Notation 

A column vector is denoted by a bold lower-case letter (x) 
and a matrix is by a bold upper-case letter (X). The fact that 
a matrix A is positive semidefinite (p.s.d.) is denoted thus 
A )^ 0. The inequality A )^ B is intended to indicate that 
A — B )^ 0. In the case of vectors, a > b denotes the element- 
wise version of the inequality, and when applied relative to 
a scalar (e.g., a > ) the inequality is intended to apply 
to every element of the vector. For matrices, we denote by 
^mxn ^YiQ vector space of real matrices of size m x n, and 
the space of real symmetric matrices as S. Similarly, the space 



of symmetric matrices of size n x n is S^, and the space 
of symmetric positive semidefinite matrices of size n x n is 
denoted as S!J:. The inner product defined on these spaces 
is (A,B) = Tr(A^B). Here Tr(-) calculates the trace of a 

II 1 1 2 

matrix. The Frobenius norm of a matrix is defined as 1 1 X 1 1 ^ = 
Tr(XX^) = Tr(X^X), which is the sum of all the squared 
elements of X. diag(-) extracts the diagonal elements of a 
square matrix. Given a symmetric matrix X and its eigen- 
decomposition X = U5]U^ (U being an orthonormal matrix, 
and T, being real and diagonal), we define the positive part of 
X as 



(X)+ = U max(diag(5]),0) 



and the negative part of X as 



(X)- 



u 



min(diag(5]),0) 



U' 



U'. 



Clearly, X = (X)+ + (X)_ holds. 

B. Euclidean Projection onto the p.s.d. cone 

Our proposed method relies on the following standard 
results, which can be found in textbooks such as Chapter 
8 of |^14J. The positive semidefinite part (X)+ of X is the 
projection of X onto the p.s.d. cone: 



(X)+ = { 



mm 

Y 



Y-xr, s.t. Y:^o 



} 



(1) 



It is not difficult to check that, for any Y )^ 0, 

|2 ii/^^N l|2 



(X) 



+ IIf 



(X)- 



< X- Y 



In other words, although the optimization problem in ([T]) 
appears as an SDP programming problem, it can be simply 
solved by using eigen-decomposition, which is efficient. It 
is this key observation that serves as the backbone of the 
proposed fast method. 

The rest of the paper is organized as follows. In Section 
im we present the main algorithm for learning a Mahalanobis 
metric using efficient optimization. In Section III we extend 
our algorithm to more general Frobenius norm regularized 
semidefinite problems. The experiments on various datasets 
are in shown Section |IV| and we conclude the paper in Section 

lYl 

II. Large-margin Distance Metric Learning 

We now briefly review quadratic Mahalanobis distance 
metrics. Suppose that we have a set of triplets SS = 
{(a^, aj, a/c)}, which encodes proximity comparison informa- 
tion. Suppose also that dist^j computes the Mahalanobis dis- 
tance between and aj under a proper Mahalanobis matrix. 
That is, dist^j = ||a^ — ajW^ = {ai — ajYlL{ai — aj), where 
X G S;^^^, is positive semidefinite. Such a Mahanalobis 
metric may equally be parameterized by a projection matrix 
L where X = LL^. 

Let us define the margin associated with a training triplet 
as pr = {ai - a/e)^X(ai - ak) - {ai - aj)"^X(ai - aj) = 
(A^,X), with A^ = {ai-ak){ai-aky -{ai-aj){ai-ajy . 
Here r represents the index of the current triplet within the set 
of m training triplets SS. As will be shown in the experiments 



4 



below, this type of proximity comparison among triplets may 
be easier to obtain that explicit distances for some applications 
like image retrieval. Here the metric learning procedure solely 
relies on the matrices (r = 1, . . . , m). 

A. Primal problems of Mahanalobis metric learning 

Putting it into the large-margin learning framework, the 
optimization problem is to maximize the margin with a reg- 
ularization term that is intended to avoid over-fitting (or, in 
some cases, makes the problem well-posed): 

s.t. : (A^, X) > p - r = 1, • • • , m, (PI) 
I > 0,p > 0,Tr(X) = 1,X 0. 

Here Tr(X) = 1 removes the scale ambiguity in X. This is 
the formulation proposed in BoostMetric |[8|. We can write the 
above problem equivalently as 



min Tr(X) 
s.t. : (A^, X) > 1 — r = 1, • • • , m, 

I > o,x :^ 0. 



(P2) 



These formulations are exactly equivalent given the appropri- 
ate choice of the trade-off parameters Ci and C2. The theorem 
is as follows. 



Theorem 1. A solution of ( |PT] ), X^, is also a solution of ( |P2| ) 
and vice versa up to a scale factor 

More precisely, if ( [Pl| ) with parameter Ci has a solution 
(X^^, I"", p"" > 0), then ^) is the solution of (|P2]) with 
parameter Here Opt(|Pl|) is the optimal 

objective value of ( |P1| ). 

Proof: It is well known that the necessary and sufficient 
conditions for the optimality of SDP problems are primal 
feasibility, dual feasibility, and equality of the primal and dual 
objectives. We can easily derive the duals of ( |PT] ) and ( |P2] ) 
respectively: 



and. 



mm 7 

l^n = l;0<n< ^; 



Em 
r=l 

u 

S.t. '. Uf-^r ^ I7 

< u < ^. 



(Dl) 



(D2) 



Here I is the identity matrix. 

Let (X^, p^) represent the optimum of the primal prob- 
lem ( |PT] ). Primal feasibility of ( |PT] ) implies primal feasibility 
of (|P2l), and thus that 



> 1 



Let (7^,1/^) be the optimal solution of the dual problem 
( |D1| ). Dual feasibility of ( |D1| ) implies dual feasibility of 



dm] ), and thus that Y.rK-ll'^'^r ^ I, and < n'^/7'^ < 
C2/(m7^). Since the duality gap between ( |PT] ) and ( |D1| ) is 
zero, Opt(|DT]) = 7* = Opt(|PT]). 

Last we need to show that the objective function values of 
( |P2| ) and ( |D2| ) are the same. This is easy to verify from the 
fact that Opt(|PlJ = Opt(|DT]): 



^p^(l"^n^)-^l^C = 7*TV(X*) 
Tr(X*)/p* + ^I^r/P* = (l^^*)/7*. 



This concludes the proof. ■ 
Both problems can be written in the form of standard SDP 
problems since the objective function is linear and a p.s.d. con- 
straint is involved. Recall that we are interested in a Frobenius 
norm regularization rather that a trace norm regularization. The 
key observation is that the Frobenius norm regularization term 
leads to a simple and scalable optimization. So replacing the 
trace norm in (|P2l) with the Frobenius norm we have: 



min ^ X L 

x,^ ^11 Wh 



S.t.: (A^,X) > 1 - ^^,r = 1,•• 
| > 0,X 0. 



(P3) 



Although ( |P3] ) and ( |P2| ) are not exactly the same, the only 
difference is the regularization term. Different regularizations 
can lead to different solutions. However, as the li and ^2 
norm regularizations in the case of vector variables such as 
in support vector machines (SVM), in general, these two 
regularizations would perform similarly in terms of the final 
classification accuracy. Here, one does not expect that a 
particular form of regularization, either the trace or Frobenius 
norm regularization, would perform better than the other one. 
As we have pointed out, the advantage of the Frobenius norm 
is faster optimization. 

One may convert ( |P3] ) into a standard SDP problem by 
introducing an auxiliary variable: 



S.t.: (A^,X) > 1 - ^^,r : 



min 5 

x,.e,5 



1,- 



C>0,X:^0; \\\X.\t<5. 



The last constraint can be formulated as a p.s.d. constraint 

" 1 xl 

X 25 ^- theory, we can use an off-the-shelf 

SDP solver to solve this primal problem directly. However, 
as mentioned previously, the computational complexity of this 
approach is very high, meaning that only small-scale problems 
can be solved within reasonable CPU time limits. 

Next, we show that, the Lagrange dual problem of ( |P3] ) has 
some desirable properties. 



B. Dual problems and desirable properties 

We first introduce the Lagrangian dual multipliers, Z which 
we associate with the p.s.d. constraint X )^ 0, and u which 
we associate with the remaining constraints upon ^. 
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The Lagrangian of ( |P3] ) then becomes 

£(X^,Z^) = i||X||| + ifEr=l?^ - Er^r {Ar,X) 
primal dual 

with u >0 and Z )^ 0. We need to minimize the Lagrangian 
over X and ^, which can be done by setting the first derivative 
to zero, from which we see that 

X* = Z* + EXAr, (2) 

and ^ > u> 0. Substituting the expression for X back into 
the Lagrangian, we obtain the dual formulation: 

max E^i^r - ^||Z + YT=i^rK\\l (D3) 

s.t.: ^ > n > 0,Z :^ 0. 

This dual problem still has a p.s.d. constraint and it is not clear 
how it may be solved more efficiently than by using standard 
interior-point methods. Note, however, that as both the primal 
and dual problems are convex. Slater's condition holds, under 
mild conditions (s ee p4| for details). Strong duality thus holds 
between ( |P3] ) and ( |D3| ), which means that the objective values 
of these two problem coincide at optimality and in many cases 
we are able to indirectly solve the primal by solving the dual 
and vice versa. The Karush-Kuhn-Tucker (KKT) conditions 
^ thus enable us to recover X^, which is the primal variable 
of interest, from the dual solution. 

Given a fixed u, the dual problem ( |D3| ) may be simplified 

mm\\Z^Y^'^^^UrAr\\l, s.t.:Z:^0. (3) 
z 

To simplify the notation we define A as a function of u 

Problem ^ then becomes that of finding the p.s.d. matrix Z 
such that ||Z — A||| is minimized. This problem has a closed- 
form solution, which is the positive part of A: 

Z^ = (A)+. (4) 

Now the original dual problem may be simplified 

max ZZi^r - ^11 (A)- lip, s.t.: ^ > n > 0. (5) 

The KKT condition is simplified into 

X* = (A)+-A = -(A)_. (6) 

From the definition of the operator (•)_, X"^ computed by ^ 
must be p.s.d. Note that we have now achieved a simplified 
dual problem which has no matrix variables, and only simple 
box constraints on u. The fact that the objective function of 
^ is differentiable (but not twice differentiable) allows us 
to optimize for n in ([5j using gradient descent methods (see 
Sect. 5.2 in [15J). To illustrate why the objective function is 
differentiable, we can see the following simple example. For 
F(X) = ^||(X)_||p, the gradient can be calculated as 

VF(X) = (X)_, 



because of the following fact. Given a symmetric ^X, we have 

F(X + SX.) = F(X) + Tr(JX(X)_) + o{SX.). 

This can be verified by using the perturbation theory of 
eigenvalues of symmetric matrices. When we set JX to be 
very small, the above equality is the definition of gradient. 

Hence, we can use a sophisticated off-the-shelf first-order 
Newton algorithm such as L-BFGS-B |16| to solve ([5]). In 
summary, the optimization procedure is as follows. 

1) Input the training triplets and calculate A^, r = 1 . . . m. 

2) Calculate the gradient of the objective function in ([5]), 
and use L-BFGS-B to optimize ([5]). 

3) Calculate A using the output of L-BFGS-B (namely, u^) 
and compute X"*" from ^ using eigen-decomposition. 

To implement this approach, one only needs to implement the 
callback function of L-BFGS-B, which computes the gradient 
of the objective function of (|5]). Note that other gradient 
methods such as conjugate gradients may be preferred when 
the number of constraints (i.e., the size of training triplet set, 
m) is large. The gradient of dual problem ([Sj can be calculated 
as 

g{ur) = 1 + ^(A)_, A^^, r = 1, . . . , m. 

So, at each iteration, the computation of (A)_, which requires 
full eigen-decomposition, only need be calculated once in 
order to evaluate all of the gradients, as well as the function 
value. When the number of constraints is not far more than the 
dimensionality of the data, eigen-decomposition dominates the 
computational complexity at each iteration. In this case, the 
overall complexity is 0(t ■ D^) with t being around 20 ~ 30. 

III. General Frobenius Norm SDP 

In this section, we generalize the proposed idea to a broader 
setting. The general formulation of an SDP problem writes: 

min (C, X) , s.t. X 0, (A^, X) < 6^, z = 1 . . . m. 

We consider its Frobenius norm regularized version: 

min (C,X) + ^||X||J, s.t. X 0, (A„ X) < 6,,Vz. 

Here a is a regularized constant. We start by deriving the 
Lagrange dual of this Frobenius norm regularized SDP. The 
dual problem is, 

min - C - Ml^^h^u, s.t. Z 0,n > 0. (7) 

The KKT condition is 

X* = cr(Z*- A-C), (8) 

where we have introduced the notation A = Y^^=i ^i-^i- Keep 
it in mind that A is a function of the dual variable u. As in 
the case of metric learning, the important observation is that 
Z has an analytical solution when u is fixed: 

Z = (C + A)+. (9) 

Therefore we can simplify ([7]) into 

min |cr||(C + A)_||?. + 6^n, s.t. n > 0. (10) 
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So now we can efficiently solve the dual problem using 
gradient descent methods. The gradient of the dual function is 

g{ui) = cr^(C + A)_, A^^ +6i,Vi = l...m. 

At optimality, we have X"^ = -cr(C + A^)_. 

The core idea of the proposed method here may be applied 
to an SDP which has a term in the format of Frobenius norm, 
either in the objective function or in the constraints. 

In order to demonstrate the performance of the proposed 
general Frobenius norm SDP approach, we will show how 
it may be applied to the problem of Maximum Variance 
Unfolding (MVU). The MVU optimization problem writes 

max Tr(X) s.t. (A^, X) < 6^, Vz; 1^X1 = 0; X 0. 

Here {A^, 6^}, z = 1 • • • , encode the local distance constraints. 
This problem can be solved using off-the-shelf SDP solvers, 
which, as is described above, does not scale well. Using 
the proposed approach, we modify the objective function 
to maxx Tr(X) - ^||X|||. When a is sufficiently large, 
the solution to this Frobenius norm perturbed version is a 
reasonable approximation to the original problem. We thus 
use the proposed approach to solve MVU approximately. 

IV. Experimental results 

We first run metric learning experiments on UCI benchmark 
data, face recognition, and action recognition datasets. We 
then approximately solve the MVU problem |11| using the 
proposed general Frobenius norm SDP approach. 

A. Distance metric learning 

1) UCI benchmark test: We perform a comparison between 
the proposed FrobMetric and a selection of the current state- 
of-the-art distance metric learning methods, including RCA 
161, NCA 1 5 1, LMNN 1 1 1, BoostMetric \8j and ITML |7| on 
data sets from the UCI Repository. 

We have included results from PCA, LDA and support 
vector machine (SVM) with RBF Gaussian kernel as baseline 
approaches. The SVM results achieved using the libs vm 
implementation. The kernel and regularization parameters of 
the SVMs were selected using cross validation. 

As in 1 1 1, for some data sets (MNIST, Yale faces and USPS), 
we have applied PCA to reduce the original dimensionality and 
to reduce noise. 

For all experiments, the task is to classify unseen instances 
in a testing subset. To accumulate statistics, the data are ran- 
domly split into 10 training/validating/testing subsets, except 
MNIST and Letter, which are already divided into subsets. We 
tuned the regularization parameter in the compared methods 
using cross-validation. In this experiment, about 15% of data 
are used for cross-validation and 15% for testing. 

For FrobMetric and BoostMetric in jSj, we use 3-nearest 
neighbors to generate triplets and check the performance using 
3NN. For each training sample a^, we find its 3 nearest 

^LMNN can solve for either X or the projection matrix L. When LMNN 
solves for X on "Wine" set, the error rate is 20.77% ± 14.18%. 



neighbors in the same class and the 3 nearest neighbors in 
the difference classes. With 3 nearest neighbors information, 
the number of triplets of each data set for FrobMetric and 
BoostMetric are shown in Table U FrobMetric and BoostMet- 
ric have used exactly the same training information. Note that 
other methods do not use triplets as training data. The error 
rates based on 3NN and computational time for each learning 
metric are shown as well. 

Experiment settings for LMNN and ITML follow the origi- 
nal work 1 1 1 and respectively. The identity matrix is used 
for ITML's initial metric matrix. For NCA, RCA, LMNN, 
ITML and BoostMetric, we used the codes provided by the 
authors. We implement our FrobMetric in Matlab and L- 
BFGS-B is in Fortran and a Matlab interface is made. All 
the computation time is reported on a workstation with 4 Intel 
Xeon E5520 (2.27GHz) CPUs (only single core is used) and 
32 GB RAM. 

Table [l| illustrates that the proposed FrobMetric shows error 
rates comparable with state-of-the-art methods such as LMNN, 
ITML, and BoostMetric. It also performs on par with a 
nonlinear SVM on these datasets. 

In terms of computation time, FrobMetric is much faster 
than all convex optimization based learning methods (LMNN, 
ITML, BoostMetric) on most data sets. On high-dimensional 
data sets with many data points, as the theory predicts, 
FrobMetric is significantly faster than LMNN. For example, 
on MNIST, FrobMetric is almost 140 times faster. FrobMetric 
is also faster than BoostMetric, although at each iteration 
the computational complexity of BoostMetric is lower. We 
observe that BoostMetric requires significantly more iterations 
to converge. 

Next we use FrobMetric to learn a metric for face recogni- 
tion on the "Labeled Faces in the Wild" data set fTsl. 

2) Unconstrained face recognition: In this experiment, we 
have compared the proposed FrobMetric to state-of-the-art 
methods for the task of face pair-matching problem on the 
"Labeled Faces in the Wild" (LFW) [IS J data set. This is a data 
set of unconstrained face images, including 13, 233 images of 
5, 749 people collected from news articles on the internet. The 
dataset is particularly interesting because it captures much of 
the variation seen in real images of faces. The face recognition 
task here is to determine whether a presented pair of images 
are of the same individual. So we classify unseen pairs whether 
each image in the pair indicates same individual or not, by 
applying M/cNN of 1 19] instead of /cNN. 

Features of face images are extracted by computing 3-scale, 
128-dimensional SIFT descriptors f20l, which center on 9 
points of facial features extracted by a facial feature descriptor, 
as described in |19|. PCA is then performed on the SIFT 
vectors to reduce the dimension to between 100 and 400. 

Since the proposed FrobMetric method adopts the triplet- 
training concept, we need to use individual's identity infor- 
mation to generate the third example in a triplet, given a pair. 
For matched pairs, we find the third example that belongs to a 
different individual with k nearest neighbors (A: is between 5 
and 30). For mismatched pairs, we find the k nearest neighbors 
(k is between 5 to 30) that have the same identity as one of 
the individuals in the given pair. Some of the generated triplets 
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TABLE I 

Test errors of various metric learning methods on UCI data sets with 3-NN. NCA |[5) does not output a result on those larger 
data sets due to memory problems. standard deviation is reported for data sets having multiple runs. 





MNIST 


USPS 


letters 


Yale faces 


Bal 


Wine 


Iris 


# samples 


70,000 


11,000 


20,000 


2,414 


625 


178 


150 


# triplets 


450,000 


69,300 


94,500 


15,210 


3,942 


1,125 


945 


dimension 


784 


256 


16 


1,024 


4 


13 


4 


dimension after PCA 


164 


60 




300 








# training 


50,000 


7,700 


10,500 


1,690 


438 


125 


105 


# validation 


10,000 


1,650 


4,500 


362 


94 


27 


23 


# test 


10,000 


1,650 


5,000 


362 


93 


26 


22 


# classes 


10 


10 


26 


38 


3 


3 


3 


# runs 


1 


10 


1 


10 


10 


10 


10 


Error Rates % 
















Euclidean 


3.19 


4.78 (0.40) 


5.42 


28.07 (2.07) 


18.60 (3.96) 


28.08 (7.49) 


3.64 (4.18) 


PCA 


3.10 


3.49 (0.62) 


- 


28.65 (2.18) 


- 


- 


- 


LDA 


8.76 


6.96 (0.68) 


4.44 


5.08 (1.15) 


12.58 (2.38) 


0.77 (1.62) 


3.18 (3.07) 


SVM 


2.97 


2.15 (0.30) 


2.96 


4.94 (2.14) 


5.59 (3.61) 


1.15 (1.86) 


3.64 (3.59) 


RCA I6j 




7.85 


5.35 (0.52) 


4.64 


7.65 (1.08) 


17.42 (3.58) 


0.38 (1.22) 


3.18 (3.07) 


NCA B 




- 


- 


- 


- 


18.28 (3.58) 


28.08 (7.49) 


3.18 (3.74) 


LMNN 


1| 


2.30 


3.49 (0.62) 


3.82 


14.75 (12.11) 


12.04 (5.59) 


3.46 (3.82; 1 


3.64 (2.87) 


ITML llj 


2.80 


3.85 (1.13) 


7.20 


19.39 (2.11) 


10.11 (4.06) 


28.46 (8.35) 


3.64 (3.59) 


BoostMetric js 


2.76 


2.53 (0.47) 


3.06 


6.91 (1.90) 


10.11 (3.45) 


3.08 (3.53) 


3.18 (3.74) 


FrobMetric (this work) 


2.56 


2.32 (0.31) 


2.72 


9.20 (1.06) 


9.68 (3.21) 


3.85 (4.44) 


3.64 (3.59) 


Computational Time 
















LMNN 


llh 


20s 


1249s 


896s 


5s 


2s 


2s 


ITML 


1479s 


72s 


55s 


5970s 


8s 


4s 


4s 


BoostMetric 


9.5h 


338s 


3s 


572s 


less than Is 


2s 


less than Is 


FrobMetric 


280s 


9s 


13s 


335s 


less than Is 


less than Is 


less than Is 



are shown in Figure [T] We select the regularization parameter 
using cross validation on View 1 and train and test the metric 
using the 10 provided splits in View 2 as suggested by 

Simple recognition systems with a single descriptor 
Table shows the performance of FrobMetric 's under varying 
PCA dimensionality and number of triplets. Increasing the 
number of training triplets gives a slight improvement in 
recognition accuracy. The dimension after PCA has more 
impact on the final accuracy for this task. We also report the 
CPU time required. 

In Figure [2] we show ROC curves for FrobMetric and related 
face recognition algorithms. These curves were generated by 
altering the threshold value across the distributions of match 
and mismatch similarity scores within M/cNN. Figure |2] (a) 
shows methods that use a single descriptor and a single 
classifier only. As can be seen, our system using FrobMetric 
outperforms all others. 

Complex recognition systems with one or more descrip- 
tors Figure [2] (b) plots the performance of more complicated 
recognition systems that use hybrid descriptors or combina- 
tions of classifiers. See Table [nil for details. 

As stated above, the leading algorithms have used either 
1) additional appearance information, 2) multiple scores from 
multiple descriptors, or 3) complex recognition systems with 
hybrids of two or more methods. In contrast, our system using 
FrobMetric employs neither a combination of other methods 
nor multiple descriptors. That is, our system exploits a very 




Fig. 1. Generated triplets based on pairwise information provided by the 
LFW data set. The first two belong to the same individual and the third is a 
different individual. 
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TABLE II 

Comparison of the face recognition performance accuracy (%) and CPU time of our proposed FrobMetric on LFW datasets 

VARYING PCA DIMENSIONALITY AND THE NUMBER OF TRIPLETS IN EACH FOLD FOR TRAINING. 



# triplets 


lOOD 


200D 


300D 


400D 


Accuracy 










3,000 


82.10 (1.21) 


83.29 (1.59) 


83.81 (1.04) 


84.08 (1.18) 


6,000 


82.26 (1.27) 


83.55 (1.28) 


84.06 (1.06) 


83.91 (1.48) 


9,000 


82.40 (1.30) 


83.62 (1.18) 


84.08 (0.92) 


84.34 (1.23) 


12,000 


82.50 (1.22) 


83.86 (1.18) 


84.13 (0.84) 


84.19 (1.31) 


15,000 


82.55 (1.30) 


83.70 (1.22) 


84.29 (0.77) 


84.27 (0.90) 


18,000 


82.72 (1.24) 


83.69 (1.23) 


84.20 (0.84) 


84.32 (1.45) 


CPU Time 










3,000 


51s 


215s 


373s 


937s 


6,000 


100s 


222s 


661s 


1,312s 


9,000 


142s 


534s 


1,349s 


3,499s 


12,000 


186s 


647s 


1,295s 


6,418s 


15,000 


235s 


704s 


1,706s 


3,616s 


18,000 


237s 


830s 


2,342s 


7,621s 



TABLE III 

Test accuracy (%) on LFW datasets. ROC curve labels in Figure[2]are described here with details. 





SIFT or single descriptor + single classifier 


multiple descriptors or classifiers 


Turk et al. hl\ 


60.02 (0.79) 
'Eigenfaces' 




Nowak et al. |22j 


73.93 (0.49) 
'Nowak-funneled' 




Huang et al. | 23 1 


70.52 (0.60) 
'Merl' 


76.18 (0.58) 
'Merl+Nowak' 


Wolf et al. in 2008 (24j 




78.47 (0.51) 

'Hybrid descriptor-based' 


Wolf et al. in 2009 j25j 


72.02 


86.83 (0.34) 

'Combined b/g samples based methods' 


Pinto et al. |26j 


79.35 (0.55) 
'Vl-like/MKL' 




Taigman et al. \TTj 


83.20 (0.77) 


89.50 (0.40) 

'Multishot combined' 


Kumar et al. |28| 




85.29 (1.23) 

'attribute + simile classifiers' 


Cao et al. |29| 


81.22 (0.53) 

'single LE + holistic' 


84.45 (0.46) 
'multiple LE + comp' 


Guillaumin et al. |l9| 


83.2 (0.4) 
'LDMU 


87.5 (0.4) 
'LMNN + LDML' 


FrobMetric (this work) 


84.34 (1.23) 

'FrobMetric' on SIFT 





simple recognition pipeline. The method thus reduces the 
computational costs associated with extracting the descriptors, 
generating the prior information, training, and computing the 
recognition scores. 

With such a simple metric learning approach, and modest 
computational cost, it is notable that the method is only 
slightly outperformed by state-of-the-art hybrid systems (test 
accuracy of 84.34% ± 1.23% versus 89.50% ± 0.40% on 
the LFW datasets). We would expect that the accuracy of 
the FrobMetric approach would improve similarly if more 
features, such as local binary pattern (LBP) pQ| for instance, 
were used. 



The FrobMetric approach shows better classification perfor- 
mance at a lower computational cost than comparable single 
descriptor methods. Despite this level of performance it is 
surprisingly simple to implement, in comparison to the state- 
of-the-art. 

3) Metric learning for action recognition: In this experi- 
ment, we compare the performance of the proposed method 
with that of existing approaches on two action recognition 
benchmark data sets, KTH | [3T| and Weizmann | [32| . Some 
examples of the actions are shown in Figure [3] We aim to 
demonstrate again the advantage of our method in reducing 
computational overhead while achieving excellent recognition 
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Eigenfaces 

Nowak-funneled 

Merl 

V1-like/MKL 

- LDML 

- Single LE + holistic 

- FrobMetric 



0.2 
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false positive rate 




Merl+Nowak 

Hybrid descriptor-based 

LDML+LMNN 

Combined b/g samples based methods 
Attribute + Simile classifiers 
Multiple LE + comp 
Multishot combined 
FrobMetric 



0.4 0.6 
false positive rate 

(b) 



Fig. 2. (a) ROC Curves that use a single descriptor and a single classifier, (b) 
ROC curves that use hybrid descriptors or single classifiers and FrobMetric 's 
curve. Each point on a curve is the average over the 10 runs. 



performance. 

The KTH dataset in this experiment consists 2, 387 video 
sequences. They can be categorized into six types of human 
actions including boxing, hand-clapping, jogging, running, 
walking and hand-waving. These actions are conducted by 25 
subjects and each action is performed multiple times by a 
same subject. The length of each video is about four seconds 
at 25 fps, and the resolution of each frame is 160 x 120. 
We randomly split all the video sequences based on the 
subjects into 10 pairs, each of which contains all the sequences 
from 16 subjects for training and those from the remaining 
9 subjects for test. The space-time interest points (STIP) 
p3| were extracted from each video sequence and the the 
corresponding descriptors calculated. The descriptors extracted 
from all training sequences were clustered into 4, 000 clusters 
using /c-means, with the cluster centers used to form a visual 
codebook. Accordingly, each video sequence is characterized 
by a 4000-dimensional histogram indicating the occurrence 
of each visual word in this sequence. To achieve a compact 





boxing 



handclapping 



handwaving 



jogging 



running 



walking 



Fig. 3. Examples of the actions from the KTH action dataset (3T| . 



and discriminative representation, a recently proposed visual 
word merging algorithm, called AIB |[34|, is applied to merge 
the histogram bins to reduce the dimensionality. Subsequently 
each video sequence is represented by a 500-dimensional 
histogram. 

The Weizmann data set contains temporal segmentations of 
video sequences into ten types of human actions including run- 
ning, walking, skipping, jumping -jack, jumping -forward-on- 
two-legs, jumping-in-place-on-two-legs, galloping-sideways, 
waving-two -hands, waving -one -hand, and bending. The ac- 
tions are performed by 9 actors. The action video sequences 
are represented by space-time shape features such as space- 
time "saliency", degree of "plateness", and degree of "stick- 
ness" which compute the degree of location and orientation 
movement in space-time domain by a Poisson equation [ [35| . 
This leads to a 286-dimensional feature vector for each action 
video sequence, which is as in (35 1. In this experiment, 70% 
sequences are used for training and the remaining 30% for 
testing. 



The experimental results are shown in Table IV The first 
part of the table shows the experimental setting and the second 
compares the results of various metric learning methods. 
On the KTH data set, the proposed method, FrobMetric, 
performs almost as well as BoostMetric with an error rate 
of 7.03 ± 1.46%, and outperforms all others. In doing so 
FrobMetric requires only 289.58 seconds to complete the 
metric learning, which is approximately one quarter of the time 
required by the fastest competing method (which has more 
than double the error rate). On the Weizmann data set, the 
error rate of FrobMetric is 0.59 ±0.20%, which is the second- 
best among all the compared methods. It is slightly higher than 
(but still comparable to) the lowest one 0.30 ±0.09% obtained 
by LMNN. However, in terms of computational efficiency, 
FrobMetric requres approximately one eighth of the time 
used by LMNN, and is the fastest of the methods compared. 
These results demonstrate the computational efficiency and the 
excellent classification performance of the proposed method in 
action recognition. 

B. Maximum variance unfolding 

In this section, we run MVU experiments on a few datasets 
and compare with other embedding methods. Figure |4] shows 
the embedding results for several different methods, namely. 
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TABLE IV 

Comparison of FrobMetric and other metric learning methods 
on action recognition datasets with 3-nn (standard deviation 
is reported for the datasets having multiple runs). 





KTH 


Weizmann 


# samples 


2,387 


5,594 


# triplets 


13,761 


35,280 


dimension 


500 


286 


# training 


1,529 


3,920 


test 


OJO 


1,0 /4 


# classes 


O 


lU 


# runs 


10 


10 




Euclidean 


10.55 (2.46) 


1.14 (0.19) 




RCA 


21.05 (3.86) 


3.21 (0.66) 


Error Rates % 


LMNN 


15.72 (2.57) 


0.30 (0.09) 


ITML 


27.67 (1.47) 


1.06 (0.16) 




BoostMetric 


7.05 (1.42) 


0.85 (0.31) 




FrobMetric 


7.03 (1.46) 


0.59 (0.20) 




LMNN 


1023.89s 


1343.25s 


Comp. Time 


ITML 


1004.94s 


368.68s 


BoostMetric 


4048.67 


1139.02s 




FrobMetric 


289.58s 


169.30s 





-300 -200 -100 100 200 300 



-300 -200 -100 



100 200 300 



Fig. 5. Embedding results of our method and MVU on the teapot dataset; 
(top) our results with a = 10^^ and (bottom) MVU's results. 
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Fig. 4. Embedding results of different methods on the 3D swiss-roU dataset, 
with the neighborhood size = 6 for all, and a = 10^ for our method, (a) 
Isomap, (b) LLE, (c) Our method, (d) MVU. 



isometric mapping (Isomap) p6| , locally linear embedding 
(LLE) |37| and MVU \1T\ on the 3D swiss-roll with 500 
points. We use k = 6 nearest neighbors to construct the local 
distance constraints and set a = 10^. 

We have also applied our method to the teapot and face 



image datasets from fTTl. The teapot set contains 200 images 
obtained by rotating a teapot through 360°. Each image is 
of 101 X 76 pixels. Figure [5] shows the two dimensional 
embedding results of our method and MVU. As can be seen, 
both methods preserve the order of teapot images correspond- 
ing to the angles from which the images were taken, and 
produce plausible embedding s. But in terms of running time, 
our algorithm is more than an order of magnitude faster than 
MVU, requiring only 4 seconds to run using k = 6 and 
a = 10^^, while MVU required 85 seconds. 

Figure |6] shows a two-dimensional embedding of the images 
from the face dataset. The set contains 1, 965 images (at 28 x 
20 pixels) of the same individual from different views and 
with differing expressions. The proposed method required 131 
seconds to solve this metric using k = 5 nearest neighbors 
whereas the original MVU needed 4732 seconds. 

1) Quantitative Assessment: To better illustrate the effec- 
tiveness of our method, here we provide a quantitative evalu- 
ation of the embeddings generated for the 3D swiss-roll and 
teapot datasets. Specifically, we adopt two quality mapping 
indexes, the unweighted Qnx and Bnx p8| , to measure the 
K-ary neighborhood preservation between the high and low 
dimensional spaces. Q^x represents the proportion of points 
that remain inside the K-neighborhood after projection, and 
thus larger Qnx indicates better neighborhood preservation. 
Bnx is defined as the difference in the fractions of mild K- 
extrusions and mild K-intrusions. It indicates the "behavior" 
of a dimensionality reduction method, namely, whether it 
tends to produce an "intrusive" (Bnx{K) > 0) or "extrusive" 
(Bnx{K) < 0) embedding. Intrusive embedding tends to 
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50 100 



Fig. 6. 2D embedding of face data by our approach. 



crush the manifold, which means faraway points can become 
neighbors after embedding, while extrusive one tends to tear 
the manifold, meaning some close neighbors can be embedded 
faraway from each other. In an ideal projection, B^x should 
be zero. See f38| for more details. 

The comparison of LLE, Isomap, MVU and our proposed 
method on the teapot (with a = 10^^) and the swiss roll (a = 
10^) datasets are shown in Figure [t] and Figure [s] As can be 
seen from Figure |7] and Figure [5] (a), the proposed FrobMetric 
method performs on par with MVU, while better than both 
Isomap and LLE in terms of neighborhood preservation. Note 
that all methods tend to tear the manifold as Bnx {K) is below 
zero in all cases. 

We have also made quantitative analysis of the proposed 
algorithm based on Zhang et ah p9| . They proposed sev- 
eral quantitative criteria, specifically, average local standard 
deviation (ALSTD) and average local extreme deviation 
(ALED) to measure the global smoothness of a recovered low- 
dimensional manifold; average local co-directional consistence 
(ALCD) to estimate the average co-directional consistence 
of the principle spread direction (PSD) of the data points, 
and a combined criteria to simultaneously evaluate the global 
smoothness and co-directional consistence (GSCD). 

We give the visual results of swiss-roll dataset based on 
PSD in Figure [9] in which the longer line at each sample 
represents the first PSD, and the second line is orthogonal to 
the first PSD. We also report the ALSTD and ALED, ALCD 
and GSCD in Table W\ and Table IVll From the tables we see 




Fig. 7. Quality assessment of neighborhood preservation of different 
algorithms on 3D swiss roll; (a) Qnx{K);(b) Bnx(K). 



that MVU performs best on this swiss-roll dataset, while the 
proposed FrobMetric method ranks the second best. On the 
teapot dataset, the proposed method performs slightly better 
than MVU, while worse than both Isomap and LLE. Overall, 
the proposed method is similar to the original MVU in terms 
of these embedding quality criteria. However, note that the 
proposed method is much faster than MVU in all cases. 

TABLE V 

ALSTD AND ALED RESULTS ON THE 3D SWISS ROLL AND TEAPOT 
DATASETS 



Algorithm 


FrobMetric 


MVU 


Isomap 


LLE 


ALSTD 


Swiss roll 


0.113 


0.038 


0.245 


0.328 


Teapot 


0.377 


0.481 


0.0611 


0.0805 


ALED 


Swiss roll 


0.311 


0.102 


0.668 


0.886 


Teapot 


1.0 


1.28 


0.173 


0.232 



we 



To show the efficiency of our approach, in Figure 10 
have compared the computational time between the original 
MVU implementation and the proposed method, by varying 
the number of data samples, which determines the number of 
variables in MVU. Note that the original MVU implementation 
uses CSDP 1 40 1, which is an interior-point based Newton 
algorithm. We use the 3D "swiss-roll" data here. 
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Fig. 8. Quality assessment of neighborhood preservation of different 
algorithms on teapot dataset; (a) Qnx(K); (b) Bnx(K). 



original data 





(a) 



(b) 






(c) 



(d) 



Fig. 9. Visualization results based on PSD on 3D swiss roll, with the 
neighborhood size = 6 for all, and a = 10^ for our method; (a)Isomap, 
(b)LLE, (c) our method, (d) MVU. 



TABLE VI 

ALCD AND GSCD RESULTS ON THE 3D SWISS ROLL AND TEAPOT 
DATASETS 



Algorithm FrobMetric 


MVU 1 Isomap | LLE 


ALCD 


Swiss roll 


0.983 


0.964 


0.969 


0.994 


Teapot 


0.995 


0.942 


0.997 


0.995 


GSCD 


Swiss roll 


0.1163 


0.0404 


0.2572 


0.3317 


Teapot 


0.3792 


0.5105 


0.0613 


0.0808 



V. Conclusion 

We have presented an efficient and scalable semidefinite 
metric learning algorithm. Our algorithm is simple to imple- 
ment and much more scalable than most SDP solvers. The 
key observation is that, instead of solving the original primal 
problem, we solve the Lagrange dual problem by exploiting 
its special structure. Experiments on UCI benchmark data sets 
as well as the unconstrained face recognition task show its 
efficiency and efficacy. We have also extended it to solve more 
general Frobenius norm regularized SDPs. 
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